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Due to its favorable computational efficiency time-dependent (TD) density functional theory 
(DFT) enables the prediction of electronic spectra in a high-throughput manner across chemical 
space. Its predictions, however, can be quite inaccurate. We resolve this issue with machine learning 
models trained on deviations of reference second-order approximate coupled-cluster singles and 
doubles (CC2) spectra from TDDFT counterparts, or even from DFT gap. We applied this approach 
to low-lying singlet-singlet vertical electronic spectra of over 20 thousand synthetically feasible small 
organic molecules with up to eight CONF atoms. The prediction errors decay monotonously as a 
function of training set size. For a training set of 10 thousand molecules, CC2 excitation energies can 
be reproduced to within ±0.1 eV for the remaining molecules. Analysis of our spectral database via 
chromophore counting suggests that even higher accuracies can be achieved. Based on the evidence 
collected, we discuss open challenges associated with data-driven modeling of high-lying spectra, 
and transition intensities. 


I. INTRODUCTION 

Quantum mechanical rational compound design strate¬ 
gies mm to model molecular valence electronic spectra 
holds great promise to narrow down the discovery of novel 
photonic, and optoelectronic devices. Potential applica¬ 
tions include the fabrication of low cost dye-sensitized so¬ 
lar cells |3], organic light emitting diodes [1], photosensi¬ 
tizers that are inert to environmental factors but useful in 
photodynamic therapy [S], and organic ultraviolet (UV) 
filters (aka sunscreens) in cosmetics [6]. For any given 
compound, the relevant prediction accuracy can readily 
be attained with an established excited state wavefunc- 
tion method. Successful studies include the quantitative 
description of solar cell materials [7], organic diodes [8], 
and even biologically relevant phenomena such as photo- 
induced dynamics of vitamins B2 [5], and D [TU]. For a 
robust forecast, depending on computational budget, one 
can also select a method according to the most appro¬ 
priate cost-to-performance ratio from the series of equa¬ 
tions of motion (EOM) or linear response (LR) variants 
of the coupled cluster (CC) theories CCS, CC2, CCSD, 
CCS and CCSDT. These methods scale from 0{N'^) to 
0{N^), where N is the number of orbitals [IT]. When 
increasing size, or number of molecules, the next viable 
compromise between accuracy and computational com¬ 
plexity is linear response time-dependent density func- 
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tional theory (LR-TDDFT) within the adiabatic approx¬ 
imation [El \T3\. TDDFT, commonly based on local 
or semi-local exchange correlation (XC) functionals, has 
been shown to yield qualitatively inaccurate predictions 
whenever the valence excitations involve charge-transfer 
(CT) [14], and the adiabatic approximation fails to accu¬ 
rately describe transitions with double excitation char¬ 
acter m- Such qualitative failure of TDDFT, hard to 
anticipate without visual inspection of molecular orbitals 
involved in the transitions, dramatically reduces its use¬ 
fulness for high-throughput screening of molecules with 
interesting electronic spectra. Application of CC meth¬ 
ods for large scale computation is already prohibitive 
even when considering just electronic ground state prop¬ 
erties of small sub-fractions of the known small molecule 
chemical universe, such as the GDB-17 with over 166 bil¬ 
lion organic molecules with no more than 17 atoms (not 
counting hydrogens). [TB] . 

For combinatorially and computationally hard prob¬ 
lems, such as navigating chemical space in quest of an 
optimal electronic spectra m, statistical inference from 
large volumes of data offers an appealing alternative to 
the conventional strategies of investing in ever more so¬ 
phisticated approximations, faster hardware, or more effi¬ 
cient programming. Statistical learning has already con¬ 
tributed to scientihc progress in biology [18] or climate 
research m- Inspired by the success of such efforts, sev¬ 
eral computational chemistry studies have recently made 
use of supervised machine learning (ML) models to in¬ 
fer quantum mechanical properties of query molecules 
from those of a set of example molecules, computed a 
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priori. Effectively, this amounts to modeling expecta¬ 
tion values calculated with approximate solutions to the 
electronic Schrodinger equation, most notably the en¬ 
ergy . By now, ML models have been shown to reach 
the highly coveted quantum chemical accuracy for many 
different ground-state molecular properties [2TH23]. As 
such, also quantum mechanical expectation values can 
be interpolated in chemical space m- Improvement of 
molecular models of chemical properties based on molec¬ 
ular similarity [Ml ES] are also related to this approach. 
These developments have also inspired studies on tran¬ 
sition state dividing surfaces [50] , orbital-free kinetic en¬ 
ergy density functionals m, electronic properties of crys¬ 
tals [28], transmission coefficients in nano-ribbon mod¬ 
els [29] ■ or densities of states in Anderson impurity mod¬ 
els [30]. More recently a single kernel has been intro¬ 
duced for the simultaneous modeling of multiple elec¬ 
tronic ground-state properties for training-sets comprised 
of up to 40,000 molecules m- Here, we report on our 
findings when trying to apply these ML methods to in¬ 
fer properties of molecules in their electronically excited 
states. More specifically, we discuss ML models which 
combine CC accuracy with DFT efficiency. 


II. METHODS 

A. A-ML model of excited states properties 

In Ref. [23], some of us introduced the A-ML Ansatz 
to estimate molecular ground-state properties from an 
expensive targetline theory, at the computational cost of 
an inexpensive baseline theory (B). The ML model for 
the quantitative prediction of molecular electronic spec¬ 
tra is built in analogy, using ML models of the devia¬ 
tion of TDDFT excited state properties from CC2 ref¬ 
erence numbers. We approximate an electronic static 
property, corresponding to the excited state of 
query molecule q at CC2 level of theory as the sum of 
baseline prediction and a linear combination of exponen¬ 
tially decaying functions in molecular similarity to train¬ 
ing molecules t, 

N 

t=l 

where N is the number of molecules in training set, a is 
the kernel width, and |dg — dt| corresponds to the Man¬ 
hattan (Li) norm between molecular descriptors d {vide 
infra). A previous study |22| benchmarked the perfor¬ 
mance of various norms in above equation when directly 
modeling atomization energies (no baseline), and found 
the Li norm to yield lowest cross-validated errors. The 
second term on the right side of Eq. [^ therefore models 
exclusively the error in baseline method B’s estimate of 


Pi when compared to CC2 for query molecule q 

N 

t=i 

In this study we have investigate two excited state prop¬ 
erties, Pi, namely excitation energy (with respect to the 
ground electronic state), Ei, and oscillator strength, fi, 
for the lowest two {i = 1, 2) singlet electronic states. 
Other excited states properties could have also been con¬ 
sidered with this generic approach. Due to their popular¬ 
ity we have selected for this study DFT and TDDFT as 
baseline B, and to CC2 as targetline. The CC2 method, 
with a triple-zeta basis set has been shown to predict 
experimental valence excitation energies with an MAE 
of 0.12 eV [32]. This error decreases slightly to 0.10 
eV, when CC2 is compared to the computationally more 
demanding method, CC3 [33] • To serve as a reference 
method in this ML study we therefore consider CC2 to 
represent the optimal compromise between sufficient ac¬ 
curacy and acceptable computational cost. To compare 
the impact of the baseline on the A-ML strategy we have 
considered various DFT [Sd] 133] baseline theories with in¬ 
creasing sophistication. However, any other combination 
of methods could have been chosen just as well. Our sim¬ 
plest non-zero baseline for pi = Ei, is the HOMO-LUMO 
gap of the ground-state computed using DFT-PBEO [36l - 
13^ . We also consider pi from LR-TDDFT [T31 SO] using 
the hybrid functionals PBEO, and CAM-B3LYP [41] . 

In the following, we use matrix notations compatible 
with Ref. m and denote matrices by capital bold, and 
vectors by small bold cases. Regression coefficients corre¬ 
sponding to training molecules, {cu}, have been obtained 
as solutions to 

(K -H AI) c, = pf _ pB = (3) 

where I and K are the identity and kernel matrices, re¬ 
spectively, the latter with elements Kgt = 

Note that in ML literature, the exponential kernel func¬ 
tion is also denoted as Laplace kernel, owing to the fact 
that the exponential function, in certain coordinate sys¬ 
tems, is a solution to Laplace’s equation. Eq. [^ mini¬ 
mizes the A-regularized (A quantifies the regularization 
strength) least-squares error in estimations [30] 

minimize 11 Ap)®^ - Apf * 11 ^ Ac^^Kc^, (4) 

Ci 

where || • II 2 stands for L 2 norm of a vector, (•)’’" denotes 
transpose operation, and Ap®®* is defined in Eq. (|^. 
Derivation of Eq. ([^ from Eq. Q is presented as an 
Appendix. 


B. Cross-validation 

Overfitting of the kernel models to training molecules 
is typically avoided by optimizing the two hyperparam- 
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eters {a, A) through five-fold cross validation (CV). In 
this procedure, N training molecules are randomly dis¬ 
tributed into 5 bins, each with N/5 molecules. Every 
bin is used once as a test (or validation) set, while the 
remaining four bins act as training sets. Hyperparame¬ 
ters are optimized such that they minimize the model’s 
MAE for the test-bin. Here, we employed Nelder-Mead’s 
simplex method [H] for the 2D optimization. The cross 
validation procedure is the most time-consuming process 
in training the ML model, with each evaluation of MAE 
of the test-bin requiring 0{v?) scaling matrix inversion 
operations, where n = AN/B. In the present work n is at 
most 8 k. This results in roughly one CPU day of training 
for a fully converged ML model. 

For larger training set sizes CVs become prohibitive, 
and one can employ the property-independent “single- 
kernel” Ansatz |31j . with optimal hyperparameters es¬ 
timated exclusively from the structures of the training 
molecules. This approach assumes the training data to be 
devoid of outliers, and enforces A to be a fixed, property- 
independent scalar (typically set , or close, to zero). The 
width of the kernel function can be chosen according to 
some heuristic, for example such that the maximal value 
of the off-diagonal elements of K is 1/2, which renders the 
kernel sufficiently global to have all training molecules 
contribute in the generation of the regression weights 
{cit}. For the exponential (aka Laplace) kernel, with 
Li distance metric, Kij = jd^ — Aj\/a, this constraint 
results in cr = max{|di — d^j} /log(2). In Ref. [311 we 
have demonstrated a global kernel derived in this fashion 
to enable systematic reduction of out-of-sample predic¬ 
tion errors for thirteen molecular ground-state properties 
of 112 k molecules, using up to 40 k training molecules. 
Here, in order to accelerate the CV procedure, we have 
made use of this heuristic as an initial guess. For the 
molecular datasets considered here these values in atomic 
units are typically cr = 1000, and A = 0. After CV, 
globally optimal hyperparameters have been obtained by 
taking the median of the 5 folds. A median value is con¬ 
sidered instead of mean, because the median of a distri¬ 
bution is not influenced by extreme values, such as the 
hyperparameters that could be found for a test bin with 
extreme outliers in structure or property. A final kernel 
with globally optimized hyperparameters is used for the 
prediction of the properties of out-of-sample molecules 
that are not part of training. 


C. Choice of molecular descriptor 

In order to assess the effect of the molecular repre¬ 
sentation on the ML model’s performance, we report re¬ 
sults based on two definitions of molecular representa¬ 
tions, namely, the Coulomb-matrix (CM) with atom in¬ 
dices sorted by norm of rows in order to reach invariance 
with respect to permutation of identical nuclei [20j . as 
well as a recently introduced more compact variant of 
the CM, called bag-of-bonds (BOB) [33]. The elements 


of the CM [3D] are defined as 

Mil = Mij = ZiZjIRij, (5) 

where /, and J are atomic indices, Rij is the inter¬ 
atomic distance, and Z is the atomic number. The off- 
diagonal elements of a CM uniquely represent the geom¬ 
etry and atomic composition of a molecule |44j . while 
the diagonal elements provide a simple exponential fit 
to the negative of the potential energy of the neutral 
atoms. As such, the diagonal is similar to the total po¬ 
tential energy of a neutral atom within Thomas-Fermi 
theory, Ftf = —or its modifications with a 
Z-dependent prefactor in the range 0.4-0.7 [45]. It is 
sufficient to consider only the lower or upper triangle 
of the CM. In order to enable comparisons between two 
molecules with different number of atoms, the CM matrix 
of the smaller molecule is padded with zero elements. 

BOB is a labeled set of off-diagonal CM elements 
which enables the comparison of pairwise distance be¬ 
tween any given combination of two atom types. For in¬ 
stance, for H 2 O, BOB is the set of two sorted row vectors, 
{[Mho, Mho] T > with elements corresponding to 

the CM entries. Due to the pairwise partitioning, how¬ 
ever, any two homometric molecules with identical stoi¬ 
chiometry will yield a zero descriptor difference according 
to BOB |33|. As such, BOB does not uniquely represent 
molecules. The CM, by contrast, is able to uniquely en¬ 
code any molecule, up to its enantiomers. The molecular 
dataset considered in this study, however, is devoid of ho¬ 
mometric molecules. In addition to the aforementioned 
sorted CM matrix, BOB has also been tested since it 
has been shown to yield slightly better accuracy for the 
prediction of molecular atomization energies [43] . In gen¬ 
eral, we have found that for large N, both CM and BOB 
converge towards similar prediction accuracy for energy- 
related properties. For smaller training sets, however, 
BOB typically exhibits a more substantial advantage. 

D. Excited states data 

We have relied on the recently published molecu¬ 
lar quantum chemistry database with relaxed geome¬ 
tries computed using the DFT B3LYP with basis set 6- 
3IG(2df,p) [35]. This data set corresponds to the small¬ 
est 133,885 (134 k) organic molecules with up to 9 CONF 
atoms out of the list of 166 B synthetically feasible or¬ 
ganic molecules, called GDB-17 database, and published 
by Reymond and co-workers [16] . For this study, we have 
eliminated 3,054 molecules from the 134 k dataset due 
to high steric strain in the B3LYP/6-31G(2df,p) geome¬ 
tries [46], and we further have limited ourselves to those 
molecules 21,800 molecules with only up to 8 CONF 
atoms. For these molecules we have performed single 
point calculations using the program TURBOMOLE m 
to compute the ground (So) [48], and the lowest two ver¬ 
tical electronic excited states (Si and S 2 ) of singlet spin- 
symmetry. We also performed calculations at the LR- 
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FIG. 1. Joined distributions of oscillator strength /i, and transition energy Ei for the first electronic excited singlet state of 
the 22 k organic molecules. All values correspond to the RICC2/def2TZVP level of theory. For select E\ values, representative 
molecules with large fi are shown as insets. 


TDDFT [ini level employing the hybrid XC functional 
PBEO [37l [39] with def2SVP basis set |49|, and at the 
resolution-of-identity approximate coupled cluster with 
singles and doubles substitution (RI-CC2) level with 
def2TZVP basis set [H]. Using the larger basis set, we 
also performed LR-TDDFT calculations with PBEO, and 
CAM-B3LYP [HI functionals, the latter using the pro¬ 
gram Gaussian09 m- 


All calculations were performed with Ci symmetry 
and in DET calculations default integral grids were em¬ 
ployed to compute the XC energy contributions. Eor 7 
molecules (most of them highly symmetric, e.g. cubane) 
the RI-CC2 calculations did not converged the first ex¬ 
cited state wavefunction. For 7 other molecules (with 
multiple CO groups, e.g. 2,3-dioxobutanedial) emmis- 
sion has been found, i.e. negative lowest transition en¬ 
ergy, presumably arising from orbital relaxation. For 
the purpose of this study, we have removed these ex¬ 
otic 14 molecules. The lowest two singlet transition en¬ 
ergies, as well as corresponding oscillator strengths in 
length-representation, have been used for the remaining 
21,786 molecules, to which we refer in the following as 
the set of 22 k molecules. All indices of the 22 k GDB-8 
molecules along with corresponding TDDFT, and CC2 
excitation energies are given as supplementary material 
in gdb8_22k_elec_spec. txt. The indices enable retrieval 
of geometries from the 134k GDB-9 dataset jlB] . 


III. RESULTS AND DISCUSSION 

A. Excitation energies and oscillator strengths for 
22 k organic molecules 

The smoothened distribution of CC2 predicted Sq —tSi 
transition energies Ei and corresponding oscillator 
strengths /i features in Figure for all 22 k molecules. 
This 2D count density has been computed via kernel- 
density estimation |S11 [S3]. The first excitation energy 
distribution is bimodal (see also Fig. for the ID pro¬ 
jection), corresponding to one Gaussian centered at 0.18 
a.u. with small variance, and another centered near 0.26 
a.u. with significantly larger variance (the shoulder pos¬ 
sibly implying two peaks, rather than one broad peak). 
Collectively, the 22 k molecules span the spectral range 
of UV-B and UV-C, with few molecules in the UV-A 
region (> 300 nm or < 0.15 a.u.). The lack of transi¬ 
tions in the visible region is consistent with the fact that 
small organic molecules typically exhibit an energy gap 
of > 5 eV between highest and lowest occupied molec¬ 
ular orbital, HOMO and LUMO respectively. Not sur¬ 
prisingly, when proceeding from low to high transition 
energy regions one notices that molecules gradually turn 
from being aromatic, or highly unsaturated, into increas¬ 
ingly saturated structures. The oscillator strength (/i), 
by contrast, exhibits an exponentially decaying distribu¬ 
tion, with the largest fraction of compounds in the 22 k 
set having negligible or zero values. A small minority of 
molecules, however, have significant/i-magnitude, imply- 
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ing potential usefulness of these molecules as components 
in metal-free organic sensitizers [54]. About a dozen 
molecules, highlighted in Figure display /i > 0.5, re¬ 
sulting in light harvesting efficiencies [^ larger than 
100 X (1 — 10“°'®) « 68%. These molecules contain 
ketoxime, R(R')C=NOH, or amidine, R-C(NH 2 )=NR', 
chromophores. They all exhibit push-pull type conju¬ 
gation of TT-bonds, with electron-donating, and electron- 
withdrawing groups on opposite ends, resulting in highly 
polarized electron densities. However, also the sym¬ 
metric molecule (point group C 2 h), dimethylglyoxime, a 
chelating agent commonly used in gravimetric analysis of 
nickel, has a large oscillator strength for its first excita¬ 
tion with /i = 0.56 at Ei = 0.2 a.u. 

The effect of level of theory is shown for TDPBEO and 
CC2 predictions of Ei and E 2 in the top panel of Fig¬ 
ure For both states, TDDFT leads to a depletion in 
count densities at w 7 eV when compared to the CC2 
distribution, compensated by overestimated densities in 
low and high energy regions. In the following we will 
discuss how to mitigate this depletion using ML models. 


B. ML models of excitation energies 

Despite the obvious differences in prediction in the top 
panel of Fig. the A-ML model of Eq. captures the 
necessary correction. This is illustrated by the signed 
error distributions (with respect to CC2) in the bottom 
panel of Figure]^ for both excitation energies. Distribu¬ 
tions are shown for A-ML models trained on molecular 
sub-sets containing either A=lkorA = 5k molecules, 
drawn at random from the 22 k data set. All ML re¬ 
sults discussed in this paper, including these distribu¬ 
tions, correspond to out-of-sample predictions for the 
remaining (22 k - N) molecules. For comparison, the 
TDDFT deviation from CC2 is also shown in the bot¬ 
tom panel of Fig.j^for both transition energies, resulting 
in a bimodal distribution which suggests that system¬ 
atic errors are present. These errors are can be either 
due to PBEO kernel, or smaller basis-set, or both. The 
ML errors, by contrast, are normally distributed around 
zero, with increasing and decreasing height and width, 
respectively, as one increases the training set from 1 k 
to 5 k. This implies that the A-ML model is properly 
accounting for the systematic errors in the TDDFT pre¬ 
dictions, replacing them by a normal error distribution. 
Mean absolute errors (MAEs) of the TDDFT predictions 
amount to of 0.27, and 0.37 eV, for Ai, and A 2 , respec¬ 
tively. These MAEs are reduced to 0.16, and 0.23 eV 
for the Ik ML models, and to 0.13 and 0.20 eV for the 
5 k ML models. We have also investigated the effect of 
the other aforementioned descriptor in the ML model, 
BOB. BOB results in ML prediction errors of 0.13/0.20 
and 0.09/0.16 eV for E 1 /E 2 , using models trained on Ik 
and 5 k training sets, respectively— slightly better than 
the corresponding CM predictions. 

In order to investigate in a systematic fashion the per- 
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FIG. 2. Distributions, smoothened by ID kernel density es¬ 
timation as implemented in GNUPLOT m, of spectral prop¬ 
erties and predicted errors. Top: Densities of first, and second 
singlet transition energies {Ei, and E 2 , respectively, in eV) of 
17 k organic molecules with up to eight GONF atoms, at the 
GG2, and TDPBEO levels of theory. Bottom: Error distribu¬ 
tion for El (left) and E 2 (right) with respect to GG2. Errors 
are given for TDPBEO and A-ML models based on 1 k, and 
5 k training molecules with TDPBEO baseline. 


formance of the A-ML model, we have calculated out- 
of-sample MAEs of El predictions for various baseline 
methods. In Figure]^ the resulting MAEs are shown as a 
function of training set size N for N = 0 (i.e. the error of 
the baseline method), 10, 100, Ik, 2k, 3k, 4k, 5k, and 
10 k. More specifically, zero baseline results correspond 
to setting Ef to zero in Eq. Q. We also used the PBEO 
HOMO-LUMO gap as a baseline, as well as TDPBEO 
and TDCAM-B3LYP. As one would expect, the predic¬ 
tive accuracy improves with increasing level of sophisti¬ 
cation of the baseline: The zero, gap, and TD baseline 
with def2SVP basis set yield 0.4, 0.3, and 0.13 eV, re¬ 
spectively, for the most accurate model trained on N = 
10 k molecules. Increasing the basis set from def2SVP to 
def2TZVP improves PBEO’s baseline value, eventually 
resulting in a very small MAE of 0.08 eV for 10 k A- 
ML. These observations are in line with previous bench¬ 
mark calculations [57] which concluded TDCAM-B3LYP 
is somewhat inferior to TDPBEO for the prediction of 
singlet-to-singlet excitation energies of small molecules. 
Overall, it is encouraging that all models, no matter 
which baseline, converge towards the same learning rate, 
i.e. slope on the log-log scale of error versus training set 
size. As such, the baseline merely leads to a difference in 
off-sets — which could also be compensated for by adding 
more training data. Due to the immense size of chemical 
space the addition of more molecules can easily be 
envisioned. For A-ML models of E 2 , similar curves can 
be obtained, albeit slightly off-set yielding less accurate 
predictive power. 
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FIG. 3. Systematic improvement of ML models of singlet-singlet transition energies (Ei). Mean absolute error (MAE in 
eV) with respect to reference CC2/def2TZVP values is shown as a function of trainingset size (N) for 22 k—A out-of-sample 
predictions. Various baseline methods are shown. The value of the baseline-free ML model (red) at V = 0 corresponds to the 
CC2 standard deviation in the 22k test set. Other baselines include HOMO-LUMO gap (blue), TDPBEO Ei without (yellow), 
and with (green) bivariate systematic shift corrections which explicitly account for a and vr chromophores. Also included are 
def2TZVP baseline results for TDCAM-B3LYP Ei (black) and TDPBEO Ei (blue). Baseline errors at A = 0 correspond to 
standard deviations, obtained after subtraction of an average shift with respect to the CC2-targetline. 


C. Inclusion of systematic shifts 

It is not obvious to us that there is a single reason for 
TDPBE0/def2SVP’s substantial underestimation of first 
and second transition energies near 7 eV, see Figure A 
simple pattern, however, emerges after splitting the 22 k 
set into saturated and unsaturated molecules, i.e. into 
two sets containing either tt- or cr-chromophores. The 
corresponding signed error densities for the two sets are 
well separated, as shown for Ei in Figure They are 
centered around -0.31, and -1-0.19 eV for the saturated a 
and unsaturated 7r-chromophores, respectively. The sys¬ 
tematic underestimation of TDPBFO-based Ei of 7r-type 
excitations (tt — > tt* or n —> tt*) is a well-known issue 
of approximate XC functionals when it comes to the de¬ 
scription of CT-type excitations [T3], i.e. transitions with 
small overlap between donor and acceptor orbital over¬ 
lap [SH]. Our results are consistent with this finding, 
strengthening the indications that the underestimation 
of El is universal for all 7r-type excitations. Further¬ 
more, the other distribution in Figure clearly shows 
a systematic overestimation of TDPBFO-based Ei of a- 
type excitations (cr u* or n —>■ cr*). This systematic 
blue shift of TDPBFO Ei is, at least partly, due to the 
finiteness of the relatively small basis set (def2SVP) used. 
This reasoning is in line with the variational principle: 
The difference between the lowest two eigenvalues of the 
molecular Hamiltonian is always larger when represented 
in a small basis set than when compared to the complete 


basis set limit. For instance, using literature values [5l] 
of the HOMO-LUMO gap of the water molecule, we note 
the PBEO value with the minimal basis set, STO-3G, 
to be 13.3 eV, overestimating more converged basis set 
PBEO results by roughly 4.6 eV. 



FIG. 4. Bivariate error distribution of the 

TDPBE0/def2SVP lowest singlet-singlet transition energies 
{El in eV) of 22 k organic molecules with up to eight GONF 
atoms (yellow). Partitioned error distributions over saturated 
(blue) and unsaturated (red) molecules are shown as well. 
The molecular structures correspond to extreme outliers for 
TDPBE0/def2SVP. 
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The degree of saturation can easily be detected be¬ 
forehand using SMILES strings. We can therefore read¬ 
ily exploit this knowledge by subtraction of the distri¬ 
bution’s centered value -0.31, and -1-0.19 eV from the 
baseline number for saturated and unsaturated chro- 
mophores, respectively. The resulting TDPBEO A-ML 
model in Eq. 0 improves indeed: The out-of-sample 
MAE decreases at a lower off-set with training set size, 
as shown in Fig. (|^, yet at similar learning rates as the 
other models. For the 10 k model of the Ei transition 
energy the MAE is found to decrease from 0.13 eV to 
0.1 eV. It is intersting to note that the performance of 
the TDCAM-B3LYP/def2TZVP level is virtually identi¬ 
cal with the shifted TDPBE0/def2SVP result (iV = 0), 
as well as for larger N values. For smaller training 
sets {N = 10 or 100), the shifted TDPBE0/def2SVP A- 
ML model even outperforms the corresponding TDCAM- 
B3LYP/def2TZVP variant. 


D. DFT and ML model outliers 

It is always interesting to consider the worst predic¬ 
tions of a model. The average errors discussed so far nei¬ 
ther imply better ML predictions for DFT outliers nor 
do they quantify the ML outliers. Here, we briefly dis¬ 
cuss the accuracy of predicted Ei for the 10 most ex¬ 
treme outliers among all out-of-sample molecules, i.e. all 
molecules that were not part of the training sets for the 
Ik, and 5k A-ML models. Table |l] lists SMILES strings 
of corresponding molecules, model prediction errors, and 
CC2 numbers for comparison. The 10 outliers are sorted 
by their TDPBEO, Ik, or 5k ML model deviation. As 
also already indicated in Figure the worst DFT out¬ 
liers correspond to unsaturated molecules. This obser¬ 
vation holds true for the 10 most extreme DFT outliers 
in Table H) deviating by up to 2.15 eV from CC2. These 
molecules could be of interest as benchmarks for develop¬ 
ing improved DFT kernels for TDDFT calculations. The 
numbers in Table |l] show that for all outliers, the 5 k ML 
model yields better performance than DFT, while the 1 k 
ML model improves all predictions but the one for the 
worst, namely cyclopenta-l-en-4-on (0=C1CC=CC1). This 
molecules is also shown in Fig.[^ Note that other outliers 
shown in that figure have been part of the training set and 
therefore do not feature in Table |Ij The finding that the 
ML models also improve on the baseline method’s out¬ 
liers agrees with conclusions drawn in a previous findings 
where we applied the A-ML Ansatz to model DFT-level 
enthalpies of atomization for the 134 k dataset, and where 
we found that for the most extreme outlier the baseline 
model’s error reduced systematically with the training 
set size of the augmenting ML model [531 • 

When considering the 10 most extreme outliers of the 
ML models in Table |ll neither order nor identity of the 
DFT outliers is conserved. Among the top 10 outliers 
of the 10 k model, for example, there is even a saturated 
molecule from the opposite (blue) end of the error dis¬ 


tribution in Fig. [^ tetra-fluoro-methane CF 4 , with an 
underestimating deviation -1.27 eV. 

TABLE I. 10 most extreme outliers for TDPBE0/def2SVP 
and A-ML models. Largest deviations of predicted low¬ 
est singlet-singlet transition energy {E\) from corresponding 
CC2/def2TZVP value. All values in eV. 


Molecule TDPBEO Ik A-ML 5k A-ML CC2 


top DFT outliers 
FC1=C0C=NC1=0 

1.63 

1.41 

1.40 

5.85 

CC1=C0C=CC1=0 

1.64 

1.22 

1.04 

5.69 

CC1=CC(=D)C=N01 

1.66 

1.08 

0.88 

5.23 

CC1=C(NN=N1)C=D 

1.73 

1.55 

1.51 

5.57 

0=CC1=CN=CN=C1 

1.81 

1.50 

1.42 

5.38 

CC1=C(C)CC(=0)C1 

1.82 

1.62 

1.56 

6.12 

CN1C=C(C=0)C=N1 

1.84 

1.62 

1.35 

5.82 

CC1=C(C=D)N=N01 

1.92 

1.52 

1.74 

5.82 

C#CC1=NC=CN=N1 

1.99 

1.43 

1.76 

5.02 

0=C1CC=CC1 

2.13 

2.15 

1.95 

6.44 

top Ik A-ML outliers 
0=N(=0)C1=NC=CD1 

1.53 

1.33 

1.41 

5.31 

FC1=C0C=NC1=0 

1.63 

1.41 

1.40 

5.85 

C#CC1=NC=CN=N1 

1.99 

1.43 

1.76 

5.02 

CC(=D)C1=CC=NN1 

1.62 

1.46 

0.91 

5.55 

0=CC1=CN=CN=C1 

1.81 

1.50 

1.42 

5.38 

CC1=C(C=0)N=ND1 

1.92 

1.52 

1.74 

5.82 

CC1=C(NN=N1)C=D 

1.73 

1.55 

1.51 

5.57 

CC1=C(C)CC(=D)C1 

1.82 

1.62 

1.56 

6.12 

CN1C=C(C=0)C=N1 

1.84 

1.62 

1.35 

5.82 

0=C1CC=CC1 

2.13 

2.15 

1.95 

6.44 

top 5k A-ML outliers 
0=N(=0)C1=NC=CD1 

1.53 

1.33 

1.41 

5.31 

FC(F)(F)F 

- 1.20 

-1.27 

-1.42 

13.98 

0=CC1=CN=CN=C1 

1.81 

1.50 

1.42 

5.38 

0=CC1=NC=CC=C1 

1.58 

1.31 

1.46 

5.09 

CC1=C(NN=N1)C=D 

1.73 

1.55 

1.51 

5.57 

0C1=N0C(C=0)=C1 

1.59 

1.32 

1.54 

5.18 

CC1=C(C)CC(=0)C1 

1.82 

1.62 

1.56 

6.12 

CC1=C(C=0)N=N01 

1.92 

1.52 

1.74 

5.82 

C#CC1=NC=CN=N1 

1.99 

1.43 

1.76 

5.02 

0=C1CC=CC1 

2.13 

2.15 

1.95 

6.44 


E. ML models of oscillator strengths 

We have also investigated the applicability of the A- 
ML Ansatz to model oscillator strengths, /i and /2 for 
So —t Si and So —t S 2 transitions, respectively. While 
the A-ML models of excitation energies can be system¬ 
atically improved through mere addition of training data, 
corresponding models for fi or /2 do not become more ac¬ 
curate with increasing training set size. TDCAM-B3LYP 
has been show to yield oscillator strengths with minimal 
deviations with respect to correlation TD methods [60j . 
For our 22 k dataset, TDCAM-B3LYP/def2TZVP yields 
an MAE of 0.0101 a.u., compared to CC2/def2TZVP. 















This deviation is reduced to only 0.0100, and 0.0099 a.u. 
when augmenting the CAM-B3LYP numbers with A-ML 
models trained on 1 k, and 5 k molecules, respectively. 
Also changing the descriptor from CM to BOB did not 
improve the state of affairs. 

The A-ML model approach might fail for several rea¬ 
sons. For one, fi is a rather complex property which 
requires knowledge of a certain combination of two wave- 
functions, 

/,« |(0|A|*)pA,. (6) 

This could imply the need for substantially larger train¬ 
ing sets in order to obtain satisfying learning curves. An¬ 
other explanation might be that the training problem is 
ill posed. In fact, TDDFT often yields a different order¬ 
ing of states than CC2, implying that the baseline prop¬ 
erty corresponds to a different matrix element than the 
targetline property. This, in turn, will also result in sub¬ 
stantially less efficient ML training scenarios. However, 
this reasoning, while appealing to explain the failure of 
a A®§Qprp-ML model, does not satisfyingly explain why 
also a direct ML model with zero baseline shows insignifi¬ 
cant prediction improvement with increasing training set 
size. Finally we remark that also previously we have seen 
significantly less impressive learning rates for other elec¬ 
tronic integrals, e.g. the norm of the molecular dipole 
moment in organic molecules m- 

IV. CONCLUSIONS 

In summary, we have applied the A-ML approach, pre¬ 
viously introduced to accurately model molecular ground 
state properties, to the data-driven modeling of electronic 
excitation energies. We have computed the low-lying va¬ 
lence electronic spectra for a modest chemical universe 
of 22 k organic molecules, made up from up to 8 CONF 
atoms, at the level of TDDFT (using PBEO, and CAM- 
B3LYP), and CC2. We have presented numerical evi¬ 
dence that large basis set CC2-level valence excitation 
energies can be estimated at the speed of small basis set 
TDPBEO through statistical inference of the difference, 
derived from training on a fraction of this database. 

Analysis of the data-sets, based on kernel density es¬ 
timates, suggests small basis set TDPBEO level of the¬ 
ory to over-, and under-estimate the lowest two tran¬ 
sition energies for organic molecules with a-, and tt- 
chromophores, respectively. This behavior results in well 
separated bivariate error distribution. Accounting for 
these systematic shifts enables further improvement of 
the A-ML models. From a methodological point of view, 
this procedure allows to readily integrate expert knowl¬ 
edge of error distributions in the ML model, resulting 
in improved predictions. For an automated estimation 
of systematic shifts arising from multivariate property 
distributions, one can adapt clustering protocols based 
on kernel density estimates. Such clustering has been 
done previously in the context of analyzing Monte Carlo 


trajectories EH, collective variables in molecular dynam¬ 
ics [62j[63], or even to quantify the contribution of an MO 
to total electronic energy |M] . 

The numerical evidence for the modeling of excitation 
energies suggests that severe flaws in TDDFT based pre¬ 
dictions can easily be rectified through statistical learn¬ 
ing, irrespective of their origin such as possible incorrect 
state ordering, basis set incompleteness, inherent limita¬ 
tions of adiabatic TDDFT for states with doubly excited, 
or CT character. 

The poor performance of ML models for predicting 
oscillator strengths warrants future investigations. The 
database of excited states properties for 22 k organic 
molecules (see Supporting Information) might also be 
useful for benchmarking the performance of other ap¬ 
proximations and models, as well as to facilitate the iden¬ 
tification of potential, hitherto unknown, chromophore- 
auxochrome relationships. Eventually, our study might 
aid the computational design of functional molecular 
components with desirable photochemical properties. 


V. ACKNOWLEDGEMENT 

OAvL acknowledges funding from the Swiss National 
Science Foundation (No. PP00P2_138932). ET acknowl¬ 
edges start-up funds from California State University 
Long Beach. Some calculations were performed at sci- 
CORE (http://scicore.unibas.ch/) scientific computing 
core facility at University of Basel. This research used 
resources of the Argonne Leadership Computing Facil¬ 
ity at Argonne National Laboratory, which is supported 
by the Office of Science of the U.S. DOE under contract 
DE-AC02-06CH1I357. 


APPENDIX: DERIVATION OF EQ.j^IN MATRIX 
NOTATION 

We derive the linear system of equations in Eq. iby 
employing the regularized least squares error measure, 
Eq.a Let us denote the reference property values of 
training molecules as the column vector x = The 

Kernel-Ridge-Regression Ansatz for the estimated prop¬ 
erty values of training molecules is p®®* = Kc. The L 2 - 
norm of the residual vector, penalized by regularization 
of ht coefficients, is the Lagrangian 

£ = ||p>^«f-p««‘||2 + AcTKc 
= (x — Kc)"*" (x — Kc) -I- Ac'^Kc 
= x'^x — x'^Kc — (Kc)'^x -I- (Kc)’^Kc -|- Ac'^Kc, 

(7) 

where (•)"'" denotes transpose operation. To minimize the 
Lagrangian, we equate its derivative with respect to the 
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regression coefficients vector, c, to zero 

—£ = -x^K - Kx + KKc + c^KK+ 
dc 

AKc + Ac'^K = 0. (8) 

Here we have used the fact that the kernel matrix K is 
symmetric, i.e., K"’’ = K along with the matrix calculus 
identity, {d/dc) c"'’ = I, where c is a column vector and 
c"”" is a row vector. Grouping by row and column vectors 


yields 

(KKc + AKc - Kx) + (KKc + AKc - Kx)^ = 0, 

(9) 

which is satisfied, iff 

(KKc + AKc - Kx) = 0. (10) 

Multiplication with K~^ from the left, and rearranging 
results in Eq. 

VI. SUPPLEMENTARY INFORMATION 

Indices of the 22 k GDB-8 molecules, to retrieve their 
geometries from the 134 k GDB-9 dataset |46], along with 
TDDFT, and CC2 excitation energies are collected in 
gdb8_22k_elec_spec.txt. 
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